In this part of the workshop, we’ll take the filtered/processed data from part 1 and continue with some standard downstream analyses. Beyond clustering, this part is largely just a bunch of snippets of different things you can do with the data
library(Seurat)
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.4
## ✓ tibble 3.0.1 ✓ dplyr 0.8.5
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(fgsea)
library(dorothea)
library(progeny)
We previously saved the data as an RDS file in our output/ subfolder.
seurat <- readRDS("../output/pbmc_seurat_filtered.rds")
Let’s remind ourselves of what the data looked like
DimPlot(seurat)
Important note: tSNE and UMAP are not clustering algorithms!
The semantics are important here. While tSNE/UMAP move similar data points close together, it is not defining discrete groups. With completely different cell types, you can visually tell how cells will get segregated, but we want to use an unsupervised approach where we can.
The most common clustering strategy is a graph-based clustering method called Louvain clustering. These methods require the data to be represented as a nearest-neighbor graph (ie. for every cell, find the closest k cells). The clustering algorithms are usually based on modularity metrics–ie. how can we group nodes of the network such that nodes within a cluster are much more highly connected that nodes between clusters.
First, we construct the graph on PC space
seurat <- FindNeighbors(seurat, dims=1:30)
## Computing nearest neighbor graph
## Computing SNN
Now we cluster the resultant graph.
NOTE: When clustering, there is almost always a user-defined parameter to tweak the resolution of the clustering. This relates to how many/few clusters there end up being. This will depend on the types of questions you want to ask with your data. If you’re looking for subtle cell state differences within, for example NK cells, it will take a high resolution to split the single NK population into several. However, if you’re just interested in getting a general idea about if a certain cell type is present in your tissue, you don’t need high resolution clustering to check this. For 8k cells, a resolution setting of 0.2 is usually a good starting point
seurat <- FindClusters(seurat, resolution=0.2)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 8337
## Number of edges: 351972
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9546
## Number of communities: 12
## Elapsed time: 1 seconds
Now when you run DimPlot, Seurat will show you clusters by default
DimPlot(seurat, label=T)
## Warning: Using `as.character()` on a quosure is deprecated as of rlang 0.3.0.
## Please use `as_label()` or `as_name()` instead.
## This warning is displayed once per session.
You can see that the populations that we broadly identified as monocytes (Clusters 0,8,9) and T cells (1,3,4,5,6) have formed multiple clusters. This likely corresponds to specialized states of these cell types. If my questions were more broad (eg. “How many T cells are in the blood following infection?”), I may be interested in reducing the resolution to an appropriate level.
Fow now, let’s proceed with it as it is, but please explore how this changes the resulting clusters.
This isn’t necessarily part of the analysis pipeline, but you may be interested in the relative frequency of each cell type. This could be informative when comparing samples from different patients/animals, or disease vs. control.
This code uses a very common data wrangling package called “dplyr”–learning this is a very good skill to have. A PDF cheatsheet of dplyr functions can be found (here)[https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf]
cluster_frq <- seurat@meta.data %>%
group_by(seurat_clusters) %>%
summarise(count=n()) %>%
mutate(relative_freq = count/sum(count)) %>%
mutate(data_set = "PBMC")
And here are the number of cells per cluster and the relative frequency
cluster_frq
We’ll plot a simple stacked bar chart of this using the popular plotting package “ggplot2”. It is incredibly flexible and for the most part, you can make plots look like whatever you want with it. This is the plotting package Seurat actually uses under the hood for its plotting functions. It would be very good to learn.
The plotting scripts are very modular. It can be simple, essentially just providing the data and telling it what type of plot you want – geom_col() for a barchart here
freq_plot <- ggplot(cluster_frq, aes(x = data_set, y = relative_freq, fill=seurat_clusters)) +
geom_col()
freq_plot
But then we can modify lots of different components of the plot by adding extra options to the plotting script
freq_plot <- ggplot(cluster_frq, aes(x = data_set, y = relative_freq)) +
geom_col(color="black", aes(fill=seurat_clusters), position = position_stack(reverse = TRUE)) +
ylab("Relative Frequency") +
scale_y_continuous(expand=c(0,0)) +
theme_classic() +
coord_flip() +
theme(legend.title = element_blank(),
axis.text.y=element_text(size=10, color='black'),
axis.text.x=element_blank(),
axis.title=element_blank(),
axis.ticks.x=element_blank(),
axis.line = element_blank())
freq_plot
Now that we’ve clustered the data, the next question is usually “well, which cluster is which cell type?”. In some cases, like with PBMCs, we know a lot about specific genes expressed in different cell types. In these cases, you could just annotate it based on the expression of known markers. Instead, we’ll use differential expression as a tool to identify genes that mark each cluster.
The idea here is that we’ll iterate through each cluster, running differential expression between the cells of that cluster and the aggregate of all other cells. While this may seem weird, the idea of this is to identiy genes that are uniquely expressed in a given cluster, so the ideal gene would be lowly-to-not expressed across the aggregate.
I’ll discuss a more focused differential expression down below
Depending on the options you set for this function, it can be the part of the script that takes the longest. I’ll set two options here that speed it up a bit. These settings are usually pretty good for identifying marking
cluster_markers <- FindAllMarkers(seurat,
logfc.threshold = 0.5, #To test all possible genes (very slow), set this to 0
only.pos = T) #and set this to F
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
## Calculating cluster 9
## Calculating cluster 10
## Calculating cluster 11
And lets look at the top markers (by logFC) for each cluster. You could click the “cluster_markers” variable in your Environment window to open up the entire table, but there may be too many to scroll through. Again, here’s a bit of dplyr code to wrangle that table and show us the top 10 genes for each.
cluster_markers %>%
group_by(cluster) %>%
top_n(10, avg_logFC)
-p_val is the p-value from the Wilcoxon rank sum test -avg_logFC is the log fold change between the cluster (see cluster column) and the aggregate of others -pct.1 is the percentage of cells in the cluster with detectable expression of the gene -pct.2 is the percentage of cells in the aggregate of other clusters that express it -p_val_adj is the Benjamini-Hochberg-adjusted p-value -cluster is the cluster being tested -gene is the gene
Let’s plot the top gene for every cluster
top_genes <- cluster_markers %>%
group_by(cluster) %>%
top_n(1, avg_logFC) %>%
pull(gene)
FeaturePlot(seurat, features=top_genes, cols = c("lightgrey", "red"), pt.size=0.25,
combine = T)
By going through these markers and their expression patter, it’s usually fairly straight forward to annotate each cluster. I’m not an immunologist, so I won’t get into talking this too much.
Seurat has a few other easy tools for visualization, I’ll some some here.
Violin plots show the distribution of expression values for a gene across each cluster
VlnPlot(seurat, features="CCL5", pt.size=0.25)
Shows the distribution along the x-axis
RidgePlot(seurat, features="CCL5")
## Picking joint bandwidth of 0.162
Seurat has an option to make a heatmap of expression values, organized by cluster. Honestly, I’m not a huge fan of it, but I’ll show it anyway. You can find an alternative heatmap function in Part 1.
Usually for heatmaps of expression data, you will Z-score the expression values so that mean expression of every gene is 0. Otherwise, the only patterns you’ll really see are which genes are expressed highly.
Let’s pick some genes to plot. We’ll do the top 5 marker genes per cluster.
top_genes <- cluster_markers %>%
group_by(cluster) %>%
top_n(5, avg_logFC) %>%
pull(gene) %>%
unique() #in case any markers are duplicated
DoHeatmap(seurat, features=top_genes)
Similar to a heatmap, but it encodes two features per gene: The scaled expression and the percentage of cells in the cluster expressing the gene. By default, the gene names will be along the x-axis, written horizontally. They will all overlap eachother and it will look terribly. Like I mentioned above, plotting in Seurat is done with ggplot2. I’ll cheat a little and inject some ggplot2 code on top of the Seurat script
DotPlot(seurat, features=top_genes, cols=c('lightgrey', 'red')) +
theme(axis.text = element_text(angle=45, hjust=1))
We had only really done differential expression to find markers of clusters using the FindAllMarkers() function. Seurat also has a FindMarkers() that is specifically for comparing two groups. Although it’s called FindMarkers it can be used as a general differential expression tool for a pairwise comparison. It uses the Wilcoxon rank sum test, which was one of the best tests in a recent benchmark paper.
If you are dealing with a dataset containing two groups, you can use this as a convenient differential expression function.
Its basic usage is to just compare two clusters in a seurat object. Let’s compare Memory T cells (Cluster 1) with Naive T cells (Cluster 3)
dge_tcells <- FindMarkers(seurat, ident.1 = 1, ident.2 = 3,
only.pos=F,
logfc.threshold = 0) # we'll test all genes
NOTE: There are other options you can set. For example, if you have an experiment with multiple datasets and you want to compare between conditions, but within a cluster (let’s say arbitrarily cluster 1), you can run something like: FindMarkers(seurat, ident.1=“Disease”, ident.2=“Control”, group.by=“Sample”, subset.ident=1)
NOTE 2: If you are dealing with data that has multiple samples and >1 condition, this approach kind of falls apart as it’s a little naive and doesn’t use replicate info. Check out a package like muscat for this.
Okay, our test finished running. If you click the dge_tcells variable in your Environment window, it will pop up and you can explore it a bit.
GSEA is a convenient tool for looking at gene sets that are enriched at either extreme of a ranked list of genes.
Typical GO Term enrichment will just take a list of genes and use a statistical test to see if it is enriched with GO terms. This is fine and dandy, but it gets a little awkward when you have both up- and down-regulated genes. What if a GO term shows up in both? Omit it? But what if it’s more enriched in one than the other?
GSEA kind of gets around this issue because you can rank your results by fold change and the test will assess if gene sets (including GO terms) are statistically enriched in up or downregulated genes in one test.
Before we try out GSEA, we need to import the gene sets. A variety can be found at the Broad Institute’s Molecular Signatures Database (MSigDB)[https://www.gsea-msigdb.org/gsea/msigdb/index.jsp]. I’ve included a few key collections in this github repo that we’ll load in. We’ll work with all GO terms, KEGG pathways, Reactome pathways, and MSigDB Hallmark gene sets.
hallmarks <- fgsea::gmtPathways("../data/hallmark.genesets.v6.1.symbols.gmt") #50 gene sets
kegg <- fgsea::gmtPathways("../data/kegg.genesets.v6.1.symbols.gmt") #186
go <- fgsea::gmtPathways("../data/GOTerms.BP.v6.1.symbols.gmt") #4436
reactome <- fgsea::gmtPathways("../data/reactome.genesets.v6.1.symbols.gmt") #674
gene_sets <- c(hallmarks, kegg, go, reactome)
Here’s an example of one
gene_sets[1]
## $HALLMARK_TNFA_SIGNALING_VIA_NFKB
## [1] "JUNB" "CXCL2" "ATF3" "NFKBIA" "TNFAIP3" "PTGS2"
## [7] "CXCL1" "IER3" "CD83" "CCL20" "CXCL3" "MAFF"
## [13] "NFKB2" "TNFAIP2" "HBEGF" "KLF6" "BIRC3" "PLAUR"
## [19] "ZFP36" "ICAM1" "JUN" "EGR3" "IL1B" "BCL2A1"
## [25] "PPP1R15A" "ZC3H12A" "SOD2" "NR4A2" "IL1A" "RELB"
## [31] "TRAF1" "BTG2" "DUSP1" "MAP3K8" "ETS2" "F3"
## [37] "SDC4" "EGR1" "IL6" "TNF" "KDM6B" "NFKB1"
## [43] "LIF" "PTX3" "FOSL1" "NR4A1" "JAG1" "CCL4"
## [49] "GCH1" "CCL2" "RCAN1" "DUSP2" "EHD1" "IER2"
## [55] "REL" "CFLAR" "RIPK2" "NFKBIE" "NR4A3" "PHLDA1"
## [61] "IER5" "TNFSF9" "GEM" "GADD45A" "CXCL10" "PLK2"
## [67] "BHLHE40" "EGR2" "SOCS3" "SLC2A6" "PTGER4" "DUSP5"
## [73] "SERPINB2" "NFIL3" "SERPINE1" "TRIB1" "TIPARP" "RELA"
## [79] "BIRC2" "CXCL6" "LITAF" "TNFAIP6" "CD44" "INHBA"
## [85] "PLAU" "MYC" "TNFRSF9" "SGK1" "TNIP1" "NAMPT"
## [91] "FOSL2" "PNRC1" "ID2" "CD69" "IL7R" "EFNA1"
## [97] "PHLDA2" "PFKFB3" "CCL5" "YRDC" "IFNGR2" "SQSTM1"
## [103] "BTG3" "GADD45B" "KYNU" "G0S2" "BTG1" "MCL1"
## [109] "VEGFA" "MAP2K3" "CDKN1A" "CYR61" "TANK" "IFIT2"
## [115] "IL18" "TUBB2A" "IRF1" "FOS" "OLR1" "RHOB"
## [121] "AREG" "NINJ1" "ZBTB10" "PPAP2B" "KLF4" "CXCL11"
## [127] "SAT1" "CSF1" "GPR183" "PMEPA1" "PTPRE" "TLR2"
## [133] "CXCR7" "KLF10" "MARCKS" "LAMB3" "CEBPB" "TRIP10"
## [139] "F2RL1" "KLF9" "LDLR" "TGIF1" "RNF19B" "DRAM1"
## [145] "B4GALT1" "DNAJB4" "CSF2" "PDE4B" "SNN" "PLEK"
## [151] "STAT5A" "DENND5A" "CCND1" "DDX58" "SPHK1" "CD80"
## [157] "TNFAIP8" "CCNL1" "FUT4" "CCRL2" "SPSB1" "TSC22D1"
## [163] "B4GALT5" "SIK1" "CLCF1" "NFE2L2" "FOSB" "PER1"
## [169] "NFAT5" "ATP2B1" "IL12B" "IL6ST" "SLC16A6" "ABCA1"
## [175] "HES1" "BCL6" "IRS2" "SLC2A3" "CEBPD" "IL23A"
## [181] "SMAD3" "TAP1" "MSC" "IFIH1" "IL15RA" "TNIP2"
## [187] "BCL3" "PANX1" "FJX1" "EDN1" "EIF1" "BMP2"
## [193] "DUSP4" "PDLIM5" "ICOSLG" "GFPT2" "KLF2" "TNC"
## [199] "SERPINB8" "MXD1"
We will need to organize our genes and rank them by logFC. Here’s just a little wrangling
dge_tcells$gene <- rownames(dge_tcells)
dge_tcells <- dge_tcells %>% arrange(desc(avg_logFC))
fold_changes <- dge_tcells$avg_logFC
names(fold_changes) <- dge_tcells$gene
head(fold_changes)
## CCR7 FHIT LEF1 SOX4 ACTN1 CD7
## 0.8597330 0.6751992 0.5730427 0.5323364 0.4704149 0.4449481
Now, let’s run GSEA, looking for genesets that are upregulated or downregualted comparing memory vs. naive T cells
gsea_tcell <- fgsea(pathways = gene_sets,
stats = fold_changes,
minSize=5,
maxSize=5000,
nproc = 2)
## Warning in fgseaMultilevel(...): There were 6 pathways for which P-values were
## not calculated properly due to unbalanced (positive and negative) gene-level
## statistic values.
## Warning in fgseaMultilevel(...): For some of the pathways the P-values were
## likely overestimated. For such pathways log2err is set to NA.
## Warning in fgseaMultilevel(...): For some pathways, in reality P-values are less
## than 1e-10. You can set the `eps` argument to zero for better estimation.
gsea_sig <- filter(gsea_tcell, padj <= 0.05) %>%
arrange(NES)
head(gsea_sig)
Feel free to open the table and explore it a bit more.
Seurat has a convenient function for scoring each cell for a reference geneset, which may allow us to get a general idea of gene set activity across a population of cells. The score corresponds to the "average expression level of each program, subtracted by the aggregated expression of a control feature set (see ?AddModuleScore). In practice, I wouldn’t read too much into the specific values, but would focus on the relative changes between cells
Here’s an example using an interferon gamma response gene set, which is apparently down in our memory T cells (Cluster 1) relative to naive T cells (Cluster 3)
seurat <- AddModuleScore(seurat, features=gene_sets["GO_CELLULAR_RESPONSE_TO_INTERFERON_GAMMA"],
name="IFNG_Response")
## Warning: The following features are not present in the object: HLA-H, CCL17,
## HLA-DRB3, HLA-DQB2, CCL21, CCL3L1, STAR, CCL26, IL12B, HLA-DRB4, CAMK2B, TDGF1,
## WNT5A, CCL14, CCL8, IRG1, CCL22, CCL13, ASS1, CCL15, NOS2, CAMK2A, CCL1, MID1,
## CCL18, CCL11, AQP4, CCL7, CCL24, not searching for symbol synonyms
Now, this module score is actually stored in our metadata table
hist(seurat$IFNG_Response1, breaks=50)
And we can use standard visualization tools to look at the score in our data
FeaturePlot(seurat, features="IFNG_Response1", cols=c('lightgrey', 'red'), order=T)
VlnPlot(seurat, features="IFNG_Response1", pt.size=0.1)
While there is a lot of variability between other clusters, we see that the score is at least modestly higher in cluster 3 than in cluster 1. We can see it a bit better if we look at just those clusters:
VlnPlot(seurat, features="IFNG_Response1", idents=c(1,3))
This is a tool I recently started playing with that I love. Many people are interested in which signalling pathways may be affected in their data and end up resorting to GO term or KEGG pathway enrichment. While this is okay, one problem is that most of the genesets associated with signalling pathways refer to the signalling proteins themselves (the kinases) rather than the downstream targets of the pathway.
PROGENy is a tool that developed a database of consensus target genes of 14 signalling pathways. It takes an expression matrix and scores each sample an an activity level for each pathway.
exp_mat <- as.matrix(seurat[["RNA"]]@data)
pathways <- progeny(exp_mat,
scale=T,
organism="Human")
head(pathways)
## Androgen EGFR Estrogen Hypoxia JAK-STAT
## AAACCTGAGCATCATC-1 0.3739243 -0.4210440 -0.78385318 -1.1249665 0.67011656
## AAACCTGAGCTAACTC-1 2.7119002 0.2909708 -0.03740632 -1.8433863 -1.71976461
## AAACCTGAGCTAGTGG-1 -0.5505979 0.8010731 -1.29041327 -0.1874210 -0.39791280
## AAACCTGCACATTAGC-1 -0.3145123 -0.7765205 0.47579680 -0.6871366 -0.57441065
## AAACCTGCACTGTTAG-1 -0.7444938 -0.6553544 -0.71494058 0.3190308 0.09406944
## AAACCTGCATAGTAAG-1 -1.0022811 1.9276188 1.03985549 0.8439905 -0.67957704
## MAPK NFkB p53 PI3K TGFb
## AAACCTGAGCATCATC-1 -1.20917331 -0.57688702 0.07485003 -1.0042793 -0.3322471
## AAACCTGAGCTAACTC-1 -0.08691169 2.53382293 1.95327039 0.9286018 -0.8537470
## AAACCTGAGCTAGTGG-1 0.37064068 0.60223197 -0.69110332 -0.4516438 -0.8537470
## AAACCTGCACATTAGC-1 -0.06118787 -0.57720013 0.09310605 0.6141208 -0.8537470
## AAACCTGCACTGTTAG-1 0.11030241 -0.05813758 1.20997367 0.8426313 0.4057307
## AAACCTGCATAGTAAG-1 1.57816952 2.73578623 0.60775587 1.3185506 2.1414269
## TNFa Trail VEGF WNT
## AAACCTGAGCATCATC-1 -0.5247113 0.33435896 -2.2811348 -0.89091415
## AAACCTGAGCTAACTC-1 1.9345054 0.45856154 0.4521791 0.09653278
## AAACCTGAGCTAGTGG-1 0.9729998 -1.53629839 0.4521791 -0.57886926
## AAACCTGCACATTAGC-1 -1.1875851 0.98301596 0.4521791 -0.49894987
## AAACCTGCACTGTTAG-1 -0.1982560 0.24193442 -1.7321771 0.29473206
## AAACCTGCATAGTAAG-1 2.7259701 -0.02727536 0.5453691 0.18776337
We can add all of this data into our metadata
seurat@meta.data <- cbind(seurat@meta.data, pathways)
And visualize activities. I’ll just get a list of the different pathways from the column names of that “pathways” variable
RidgePlot(seurat, features = colnames(pathways), ncol = 4)
## Picking joint bandwidth of 0.238
## Picking joint bandwidth of 0.221
## Picking joint bandwidth of 0.258
## Picking joint bandwidth of 0.224
## Picking joint bandwidth of 0.198
## Picking joint bandwidth of 0.234
## Picking joint bandwidth of 0.179
## Picking joint bandwidth of 0.233
## Picking joint bandwidth of 0.246
## Picking joint bandwidth of 0.261
## Picking joint bandwidth of 0.18
## Picking joint bandwidth of 0.227
## Picking joint bandwidth of 0.229
## Picking joint bandwidth of 0.258
Hard to see unless you stretch the image out.
FeaturePlot(seurat, features="TNFa", cols=c('lightgrey', 'red'), order=T)
Perhaps expected that monocytes have the higehst TNF activity